Routing: Adds Gateway Account Property Flag for Dynamic Hedging Control#5829
Routing: Adds Gateway Account Property Flag for Dynamic Hedging Control#5829NaluTripician wants to merge 11 commits into
Conversation
Adds a Gateway-controlled `disableCrossRegionalHedging` account property that
lets on-call engineers dynamically disable PPAF cross-region hedging without
rolling back PPAF entirely. When the flag is true, the SDK clears the
AvailabilityStrategy on the next account-properties refresh; toggling it back
to false restores the customer-configured strategy or the SDK default.
Changes
- AccountProperties: new internal `DisableCrossRegionalHedging` (nullable bool)
bound to the JSON key `disableCrossRegionalHedging`.
- GlobalEndpointManager: extends `OnEnablePartitionLevelFailoverConfigChanged`
to `Action<bool, bool>` and tracks `lastKnownDisableCrossRegionalHedging`
so the event fires when either flag changes.
- DocumentClient:
- Captures the initial gateway flag during `GetInitializationTaskAsync` and
stashes any customer-configured `AvailabilityStrategy` if the flag is
true at startup.
- `InitializePartitionLevelFailoverWithDefaultHedging` short-circuits when
the flag is true.
- `UpdatePartitionLevelFailoverConfigWithAccountRefresh` now reconciles
both PPAF state and the disable flag via `ApplyHedgingStrategyForCurrentState`,
which restores customer strategies when the flag flips back off and skips
stashing SDK-default strategies (regenerated deterministically).
- Tests: 10 new unit tests covering deserialization, init-time skip, customer-
strategy stash/restore, SDK-default rebuild, and client-level override.
- Updates 2 existing serializer contract tests that pin the `AccountProperties`
JSON shape.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
…back Addresses review comments on PR #5829: - #3 Renames shadowing parameters in UpdatePartitionLevelFailoverConfigWithAccountRefresh (isEnabled -> latestIsEnabled, disableCrossRegionalHedging -> latestDisableCrossRegionalHedging) so a future edit dropping a 'this.' prefix cannot silently change semantics. - #6 Adds early-return guard when neither PPAF enablement nor the hedging flag changed, protecting future direct callers from spurious AvailabilityStrategy mutation. - #5 Introduces hedgingStrategyLock and serializes the (flag, stash, ConnectionPolicy.AvailabilityStrategy) mutation sequence in UpdatePartitionLevelFailoverConfigWithAccountRefresh and ApplyHedgingStrategyForCurrentState, so concurrent invocations cannot interleave stash/clear/restore steps. - #11 Emits a TraceWarning when ApplyHedgingStrategyForCurrentState would silently drop a non-default AvailabilityStrategy because a previously-stashed customer strategy is still held. - #8 Replaces reflection-based field access in GatewayHedgingOverrideTests with internal test-only properties (DisableCrossRegionalHedgingForTests, CustomerConfiguredAvailabilityStrategyForTests), so renames are caught at compile time. - #2 / #12 Moves the init-time flag capture and customer-strategy stash from GetInitializationTaskAsync into InitializeGatewayConfigurationReaderAsync, before the GEM background-refresh loop is started and before the change-event handler is subscribed. This eliminates the theoretical race where a refresh-driven event could be overwritten by a subsequent init-time field assignment, and removes the inconsistent null-check style. - #10 Removes unrelated .coding-harness/ entry from .gitignore. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Please add changelog entry as well. |
…iew feedback Addresses the 2026-05-14 review comments from @kundadebdatta on PR #5829: - AccountProperties.DisableCrossRegionalHedging: adds a TODO noting that the JSON property name should move to Constants.Properties once the underlying constant ships in the Microsoft.Azure.Cosmos.Direct package. Also drops the stale `only honored for accounts with PPAF enabled'' phrase from the doc remarks, matching the design clarification that the flag applies to both PPAF and non-PPAF accounts. - DocumentClient.hedgingStrategyLock: expands the field-level comment to call out that the lock is NOT load-bearing under current callers (GEM serializes refreshes; init runs before subscription) and exists specifically for the internal test accessors and as future-proofing for new direct callers. - UpdatePartitionLevelFailoverConfigWithAccountRefresh: expands the comments on the no-change early-return guard and on the unconditional ApplyHedgingStrategyForCurrentState() call to make their rationale explicit (guard exists for direct callers / unit tests; reconcile is intentionally shared between the ppafEnablementChanged and hedgingFlagChanged paths because PPAF toggles also need to install or drop the SDK default). - UpdatePartitionLevelFailoverConfigWithAccountRefresh: adds a matching `Successfully reconciled hedging strategy dynamically'' trace for hedging-flag-only transitions, and moves the PPAF `Successfully updated PPAF configuration dynamically'' trace inside the ppafEnablementChanged branch so each transition logs exactly one terminal trace. - DisableCrossRegionalHedgingForTests / CustomerConfiguredAvailabilityStrategyForTests: expands the comment to explain why these accessors exist instead of mocking AccountProperties (no current mocking seam past GatewayAccountReader.InitializeReaderAsync, which performs real I/O) and notes that they should be removed once such a seam is introduced. - changelog.md: adds an [Unreleased] section with the PR #5829 entry under Added. No behavioral changes -- comments, traces, and the changelog entry only. GatewayHedgingOverrideTests (10 tests) continue to pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed in d9041ca. Added an The release author can rename |
…-gateway-hedging-override # Conflicts: # changelog.md
kushagraThapar
left a comment
There was a problem hiding this comment.
Hi @NaluTripician — thanks for the careful hardening work in commit 816e334fc and the second round in d9041cae, the lock discipline + init-time relocation closed off the most concerning windows, and the changelog entry at HEAD is great. A few collaborative follow-ups from a fresh-eyes pass on commit 95938a5c2. None of these duplicate Debdatta's open N1/N2/N3 (replies on those are tracked separately).
🟡 M1 — Per-request RequestOptions.AvailabilityStrategy may bypass the new operator override
Could you please help me understand the scope of this PR vs the unchecked tasks 4.1 (RequestInvokerHandler enforcement) and 6.3 (per-request test) in openspec/changes/ppaf-dynamic-hedging-control/tasks.md?
Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs:132 resolves the per-request strategy as:
AvailabilityStrategy strategy = request.RequestOptions?.AvailabilityStrategy
?? this.client.DocumentClient.ConnectionPolicy.AvailabilityStrategy;So a customer that sets AvailabilityStrategy on RequestOptions per-call would keep speculating cross-region after the gateway flag is true — which seems to contradict the operator-mitigation intent.
Two options to consider, either is fine — mostly want the scope to be unambiguous before merge:
(a) Implement task 4.1 — add an internal DocumentClient.IsHedgingDisabledByGateway accessor (read under hedgingStrategyLock) and have RequestInvokerHandler.AvailabilityStrategy(...) return null when set; or
(b) Explicitly downscope — strike the per-request scenarios from spec.md, mark task 4.1 / 6.3 as out of scope in tasks.md, and note in the PR description that per-request enforcement is a follow-up.
Happy to defer to (b) if that's the agreed scope.
🟡 M4 — Java parity gap (informational; happy to be told there's already a tracking issue)
I am curious whether there is a Java tracking issue for this — could you please help me understand whether Java is intentionally lagging on the gateway-driven knob?
I checked Azure/azure-sdk-for-java main and found:
$ grep -rn --include="*.java" "disableCrossRegionalHedging" \
sdk/cosmos/azure-cosmos/src sdk/cosmos/azure-cosmos-tests/src
(no matches)
$ grep -n "IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF" \
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
333: private static final String DEFAULT_IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF = "true";
334: private static final String IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF = "COSMOS.IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF";
335: private static final String IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF_VARIABLE = "COSMOS_IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF";
Java currently has the related-but-distinct client-side JVM lever COSMOS.IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF, but no server-driven override. Without one, an operator flipping the gateway flag will get inconsistent behavior across .NET vs Java clients on the same account — which is exactly what this knob seems designed to avoid. Java already has the GEM perPartitionAutomaticFailoverConfigModifier callback plumbing (GlobalEndpointManager.java:60), so mirroring this would be additive rather than new infrastructure. Python has no SDK-default PPAF hedging at all (default availability_strategy=False), so the gap there is largely cosmetic and probably worth a single note in spec.md calling Python out of scope.
Thanks again — happy to chat through any of these if helpful.
Round 2 review feedback (kushagraThapar, kundadebdatta) on PR #5829: * RequestInvokerHandler.AvailabilityStrategy now short-circuits to null when the gateway flag is true, giving the operator override absolute precedence over per-request and client-level AvailabilityStrategy (K-1, K-5). * DocumentClient gains an internal IsHedgingDisabledByGateway accessor that reads the cached flag under hedgingStrategyLock. * GlobalEndpointManager.RefreshDatabaseAccountInternalAsync now guards the hedging change-detection with HasValue (mirrors the existing PPAF guard). An absent property is treated as "no signal" rather than implicit false, so a transient property-drop response cannot silently re-enable hedging during the exact window the operator most wants it disabled. The cached lastKnownDisableCrossRegionalHedging is preserved across null reads (K-2). * Specs/design/proposal updated: the flag now applies to both PPAF and non-PPAF accounts; the only documented bypass is a customer-configured client-level AvailabilityStrategy (K-3). Added Cross-SDK Parity section pointing at the Java tracking issue. * Tests: - GatewayHedgingOverrideTests: RequestInvokerHandler precedence (flag-true returns null even with per-request strategy; flag-false honors the per-request strategy verbatim). - GlobalEndpointManagerTest: property-drop-after-true does NOT fire the change event; explicit-false-after-true DOES fire it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks @kushagraThapar — addressed both M1 and M4 in Re: M1 (Per-request
|
…labilityStrategy The cref RequestInvokerHandler.AvailabilityStrategy in DocumentClient.cs could not be resolved because (1) RequestInvokerHandler lives in the Microsoft.Azure.Cosmos.Handlers namespace and is not imported by DocumentClient, and (2) the simple member name 'AvailabilityStrategy' is ambiguous with the public AvailabilityStrategy class. This produced error CS1574 and broke every downstream build job (21 failing CI checks). Qualify the cref with the Handlers namespace and the (RequestMessage) parameter signature so the doc-comment resolver picks the method overload unambiguously. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ree reads Addresses two unresolved review threads from @kundadebdatta on PR #5829: 1. IsHedgingDisabledByGateway was taking a monitor lock on every request through RequestInvokerHandler — contradicting its own "atomic kill-switch" contract and adding uncontended-but-non-zero cost to the SDK hot path (CI benchmarks would not surface this since they run steady-state). 2. InitializePartitionLevelFailoverWithDefaultHedging at line 6963 read disableCrossRegionalHedging without any synchronization, even though it executes after the GEM background-refresh loop is live — a narrow but real window where a refresh-driven write could be missed for lack of an acquire barrier. Both findings collapse to the same fix: - Declare disableCrossRegionalHedging as `private volatile bool` so reads have acquire semantics and writes have release semantics. The hedgingStrategyLock continues to serialize the multi-field mutation sequence (flag + stashed customer strategy + ConnectionPolicy.AvailabilityStrategy) on the refresh / init-stash paths. - Drop the lock from IsHedgingDisabledByGateway; collapse it to an expression-body property returning the volatile field directly. - Drop the lock from the DisableCrossRegionalHedgingForTests getter for consistency; keep it on the setter so test writes happens-before any subsequent reconcile-path mutation observes them. - Rewrite the field-level and hedgingStrategyLock comments to spell out the volatile contract explicitly, so future readers do not reintroduce a lock around the hot read. Adds InitializePartitionLevelFailoverWithDefaultHedging_AfterRefreshDrivenFlagFlip_SkipsApplyingDefaultStrategy to GatewayHedgingOverrideTests, which exercises the exact race scenario the reviewer called out: a refresh-driven UpdatePartitionLevelFailoverConfigWithAccountRefresh(true, true) followed by the init-path call to InitializePartitionLevelFailoverWithDefaultHedging, asserting the SDK default is never installed. Complements the existing synthetic-setter coverage in InitializePartitionLevelFailoverWithDefaultHedging_FlagTrue_SkipsApplyingDefaultStrategy. Validation: Microsoft.Azure.Cosmos.Tests Release run — 2750 passed, 0 new failures. All 13 GatewayHedgingOverrideTests pass (12 prior + 1 new). The single failure on this leg (CosmosClientTests.InvalidKey_ExceptionFullStacktrace) is a pre-existing environment-dependent failure unrelated to this change, verified to reproduce on the pre-edit PR head. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-gateway-hedging-override # Conflicts: # changelog.md
| if (this.disableCrossRegionalHedging) | ||
| { | ||
| DefaultTrace.TraceInformation( | ||
| "DocumentClient: Skipping default PPAF hedging because Gateway property disableCrossRegionalHedging=true"); |
There was a problem hiding this comment.
🟡 Init-path SDK-default install runs outside hedgingStrategyLock — volatile fix only covers the read side
The early-return on this line is a volatile read of disableCrossRegionalHedging, but the rest of InitializePartitionLevelFailoverWithDefaultHedging (the SDK-default install at lines ~7000-7002) runs without hedgingStrategyLock. Interleaving with UpdatePartitionLevelFailoverConfigWithAccountRefresh (which does take the lock) produces this TOCTOU:
- Init thread: volatile read sees flag=
false, falls through past the early-return. - Refresh thread: takes lock, sets flag=
true, runsApplyHedgingStrategyForCurrentState— observesConnectionPolicy.AvailabilityStrategy == null(init hasn't written yet), takes the no-op path, releases lock. - Init thread: writes
ConnectionPolicy.AvailabilityStrategy = SDKDefaultCrossRegionHedgingStrategyForPPAF(...).
Final state: disableCrossRegionalHedging == true and ConnectionPolicy.AvailabilityStrategy != null. The RequestInvokerHandler.IsHedgingDisabledByGateway short-circuit added in this PR still suppresses hedging on the live request path (so the kill-switch contract holds), but the in-process invariant the rest of the file relies on is violated until the next genuine flag transition triggers another ApplyHedgingStrategyForCurrentState call. This is the same call site flagged earlier for the read side; commit 403d509f only addressed half the race.
Fix: wrap the install block (lines ~6980-7003) in lock (this.hedgingStrategyLock) { ... }, or refactor so all mutations of ConnectionPolicy.AvailabilityStrategy funnel through ApplyHedgingStrategyForCurrentState (which already owns the lock).
| this.disableCrossRegionalHedging = accountProperties.DisableCrossRegionalHedging ?? false; | ||
| if (this.disableCrossRegionalHedging && this.ConnectionPolicy.AvailabilityStrategy != null) | ||
| { | ||
| this.customerConfiguredAvailabilityStrategy = this.ConnectionPolicy.AvailabilityStrategy; |
There was a problem hiding this comment.
🟡 Init-time stash assumes anything in ConnectionPolicy.AvailabilityStrategy is the customer's strategy
This branch in InitializeGatewayConfigurationReaderAsync unconditionally captures whatever is currently in ConnectionPolicy.AvailabilityStrategy into customerConfiguredAvailabilityStrategy. That happens to be true today because InitializePartitionLevelFailoverWithDefaultHedging runs after this method (at DocumentClient.cs:1118), so the only possible source at this point is the customer-supplied ClientOptions.AvailabilityStrategy.
Compare to ApplyHedgingStrategyForCurrentState, which carefully checks IsSDKDefaultStrategyForPPAF before stashing — precisely because the SDK should never persist its own auto-generated strategy as if it were the customer's intent. The init path should mirror that defensive check so a future reorder of OpenAsync doesn't silently stash an SDK-default object and then "restore" it as a phantom customer config when the operator toggles the flag back.
Fix:
if (this.disableCrossRegionalHedging
&& this.ConnectionPolicy.AvailabilityStrategy != null
&& !(this.ConnectionPolicy.AvailabilityStrategy is CrossRegionHedgingAvailabilityStrategy s && s.IsSDKDefaultStrategyForPPAF))
{
this.customerConfiguredAvailabilityStrategy = this.ConnectionPolicy.AvailabilityStrategy;
this.ConnectionPolicy.AvailabilityStrategy = null;
}| "DocumentClient: ApplyHedgingStrategyForCurrentState observed a non-default " + | ||
| "AvailabilityStrategy while a previously-stashed customer strategy is still held; " + | ||
| "the current strategy will be cleared without being stashed. This may indicate a " + | ||
| "re-entrant code path."); |
There was a problem hiding this comment.
🟡 ApplyHedgingStrategyForCurrentState "duplicate non-default strategy" branch silently drops customer config
This else if (!ReferenceEquals(this.customerConfiguredAvailabilityStrategy, currentStrategy)) branch logs a warning and then falls through to ConnectionPolicy.AvailabilityStrategy = null, throwing away the new strategy entirely. The accompanying comment ("re-entrant code path") acknowledges this is a "shouldn't happen" path, but silently losing customer configuration is the worst possible outcome for an operator-driven kill-switch — and it would surface as "hedging mysteriously stopped working after a flag toggle" with no clean diagnostic.
If this path is truly unreachable in production, either:
Debug.Fail(orAssert) in DEBUG so test runs catch the regression immediately, and keep the warning trace for RELEASE, or- Atomically swap: stash the new strategy into
customerConfiguredAvailabilityStrategybefore clearing, so a later toggle-off can restore something instead of the empty/SDK-default.
The current behavior — warn-and-silently-clear — is the worst of both worlds.
| /// 2. Customer explicitly configured a strategy: keep / restore that strategy. | ||
| /// 3. PPAF enabled with no explicit strategy: apply SDK default PPAF hedging. | ||
| /// 4. PPAF disabled: clear any SDK-default strategy that we previously installed. | ||
| /// Caller must hold <see cref="hedgingStrategyLock"/>; the entry-point lock is reentrant so this |
There was a problem hiding this comment.
🟢 <remarks> says "Caller must hold hedgingStrategyLock" but the method itself re-acquires it
The XML doc above the method declares "Caller must hold hedgingStrategyLock; the entry-point lock is reentrant so this..." — but the very next executable line is lock (this.hedgingStrategyLock). The actual contract is "method takes the lock; recursive entry from the (already-locked) UpdatePartitionLevelFailoverConfigWithAccountRefresh caller is fine because monitor locks are reentrant on the same thread." That's a different contract from what the doc states.
Today the only external caller (InitializePartitionLevelFailoverWithDefaultHedging from the init path) does not hold the lock — see F1 — so the wording will mislead a future maintainer into thinking the lock-free init call is a contract violation rather than a real bug. Either:
- Rewrite to: "Acquires
hedgingStrategyLock. Safe to call from withinUpdatePartitionLevelFailoverConfigWithAccountRefreshbecause monitor locks are reentrant.", or - Make the contract real: have the method assume the lock is held, drop the inner
lock, and fix the init-path caller to take it.
|
|
||
| if (ppafEnablementChanged || disableHedgingFlagChanged) | ||
| { | ||
| bool latestPpafEnabled = accountProperties.EnablePartitionLevelFailover |
There was a problem hiding this comment.
🟡 latestPpafEnabled fallback reads SDK-mutated state as if it were Gateway truth
When accountProperties.EnablePartitionLevelFailover.HasValue == false but the hedging flag changed, this falls back to this.connectionPolicy.EnablePartitionLevelFailover to fill in the PPAF arg. That field is mutated by the SDK itself inside DocumentClient.UpdatePartitionLevelFailoverConfigWithAccountRefresh (line ~7045) on every prior fire of this same event. So:
- First fire: gateway says PPAF=true, hedging=false. We update
connectionPolicy.EnablePartitionLevelFailover = true. - Second fire: gateway doesn't include
EnablePartitionLevelFailover(legitimate — operator only changed the hedging flag), and we fall back to the value we ourselves wrote last time, then feed it back into the callback which compares it to itself →ppafEnablementChangedis alwaysfalse.
That's functionally correct today (no PPAF-flip side-effect runs when only hedging changed — exactly what we want), but it conflates "what the gateway said" with "what we last applied". A future writer to connectionPolicy.EnablePartitionLevelFailover (e.g., an admin API) would silently desync the cache. The hedging flag uses a dedicated lastKnownDisableCrossRegionalHedging field for exactly this reason — mirror the pattern: track a lastKnownEnablePartitionLevelFailover and fall back to that instead.
| // the previously-honored state (rather than against an implicit false baseline). | ||
| bool latestDisableHedging = accountProperties.DisableCrossRegionalHedging | ||
| ?? this.lastKnownDisableCrossRegionalHedging; | ||
| this.lastKnownDisableCrossRegionalHedging = latestDisableHedging; |
There was a problem hiding this comment.
🟡 lastKnownDisableCrossRegionalHedging advanced before Invoke — throwing handler leaves GEM/DocumentClient split-brain
Order on this line and the next:
this.lastKnownDisableCrossRegionalHedging = latestDisableHedging; // <-- advance baseline
this.OnEnablePartitionLevelFailoverConfigChanged?.Invoke(...); // <-- then notifyIf the single subscriber (DocumentClient.UpdatePartitionLevelFailoverConfigWithAccountRefresh) throws — e.g., the new hedgingStrategyLock-protected reconcile hits an unexpected AvailabilityStrategy shape and the code in F3 falls into an uncovered branch, or a future caller adds a subscriber that throws — GEM's baseline has already moved. On the next refresh, the gateway still says the same value, disableHedgingFlagChanged evaluates false, no event fires, and DocumentClient.disableCrossRegionalHedging stays at its pre-throw stale value indefinitely (until the gateway flips the flag to a different value to break the diff).
For an operator kill-switch this is exactly the wrong failure mode — once you miss a transition, you can never recover the missed one on retry.
Fix: assign lastKnown after a successful invoke, or wrap with try/catch + revert:
bool previous = this.lastKnownDisableCrossRegionalHedging;
this.lastKnownDisableCrossRegionalHedging = latestDisableHedging;
try
{
this.OnEnablePartitionLevelFailoverConfigChanged?.Invoke(latestPpafEnabled, latestDisableHedging);
}
catch
{
this.lastKnownDisableCrossRegionalHedging = previous; // allow retry next refresh
throw;
}| if (this.ConnectionPolicy.AvailabilityStrategy != null) | ||
| { | ||
| DefaultTrace.TraceInformation( | ||
| "DocumentClient: Hedging re-enabled — applied SDK default PPAF hedging strategy"); |
There was a problem hiding this comment.
🟢 Trace fires only on the success branch — silent no-op leaves operators blind
InitializePartitionLevelFailoverWithDefaultHedging has multiple silent no-op paths: PPAF disabled mid-call, AvailabilityStrategy already non-null (after the customer restore branch above ran), or the flag flipped back to true between callbacks. The trace here only logs when the install succeeded. An operator toggling the kill-switch true → false and then watching logs has no positive confirmation that the SDK ended up in a known state on the no-strategy-installed branch — they will wonder whether hedging is back on, on at a different threshold, or still off.
Add a TraceWarning on the else branch with the observed state (EnablePartitionLevelFailover, AvailabilityStrategy?.GetType().Name, current disableCrossRegionalHedging) so the post-toggle state is always recorded:
if (this.ConnectionPolicy.AvailabilityStrategy != null) { TraceInformation(...applied...); }
else { TraceWarning("DocumentClient: Hedging re-enable did not install a strategy (PPAF={0}, existing={1}, killSwitch={2})", ...); }| // Gateway-driven operator override has absolute precedence over any request-level or | ||
| // client-level AvailabilityStrategy. See spec.md → "Gateway flag disables all hedging | ||
| // when true" and tasks.md item 4.1. | ||
| if (this.client.DocumentClient.IsHedgingDisabledByGateway) |
There was a problem hiding this comment.
💬 Per-request short-circuit's two-field read is not atomic — long comment overstates consistency
This handler does two unrelated reads in sequence:
if (this.client.DocumentClient.IsHedgingDisabledByGateway) return null; // volatile bool
AvailabilityStrategy strategy = request.RequestOptions?.AvailabilityStrategy
?? this.client.DocumentClient.ConnectionPolicy.AvailabilityStrategy; // plain fieldA reader can observe IsHedgingDisabledByGateway == false (snapshot before a refresh) and then ConnectionPolicy.AvailabilityStrategy == null (snapshot after the refresh cleared it). Outcome: no hedging while a strategy was "active from the customer's perspective". That's actually fine for the kill-switch's correctness contract (kill-switch wins, never the other way), but the long XML doc on disableCrossRegionalHedging (line ~31) and the comment on IsHedgingDisabledByGateway (line ~317) imply the per-request view is consistent across both fields. It isn't.
Either:
- Tighten the doc to "the only consistency guarantee across the two reads is: when the volatile read sees
true, hedging is off — the reverse is not symmetric", or - Snapshot
ConnectionPolicy.AvailabilityStrategyto a local before the flag check so the precedence becomes "kill-switch wins regardless of which snapshot we took".
Tightening the doc is the lighter touch and matches actual behavior.
| // to expose this name on Constants.Properties, refactor this attribute to read from | ||
| // Constants.Properties.<NewConstant> for consistency with the other AccountProperties JSON bindings. | ||
| [JsonProperty(PropertyName = "disableCrossRegionalHedging")] | ||
| internal bool? DisableCrossRegionalHedging { get; set; } |
There was a problem hiding this comment.
🟡 Hard-coded "disableCrossRegionalHedging" JSON literal is not pinned by a round-trip test
The TODO above this attribute correctly flags that Constants.Properties is the established source of truth for these names, and the future Direct package update will route through it. In the meantime, the only protection against a typo or a stealth rename is the unit tests — but the new tests in GatewayHedgingOverrideTests.cs only deserialize JSON containing the literal, and the modifications to CosmosJsonSerializerUnitTests.cs / CosmosSerializerCoreTests.cs only assert serializer ordering of the property among siblings. A future PR that renames the field to DisableCrossRegionalHedgingFlag and lockstep-updates the serializer expectation would compile, pass all tests, and silently break wire-binding.
Add one serialize round-trip test (separate from the order-only ones already updated) that pins the exact JSON key:
[TestMethod]
public void DisableCrossRegionalHedging_SerializesWithExactJsonKey()
{
var props = new AccountProperties { DisableCrossRegionalHedging = true };
string json = JsonConvert.SerializeObject(props);
Assert.IsTrue(json.Contains("\"disableCrossRegionalHedging\":true"),
"JSON property name is the wire contract with Gateway and must not change without a coordinated server-side rename.");
}| /// kill-switch wins over both per-request and client-level configuration. | ||
| /// </summary> | ||
| [TestMethod] | ||
| public void RequestInvokerHandler_FlagTrue_PerRequestStrategy_ReturnsNull() |
There was a problem hiding this comment.
💬 RequestInvokerHandler_FlagTrue_PerRequestStrategy_ReturnsNull sets the flag bypassing the production reconcile path
This test uses the new DisableCrossRegionalHedgingForTests = true accessor, which only mutates the volatile bool — it never invokes UpdatePartitionLevelFailoverConfigWithAccountRefresh or ApplyHedgingStrategyForCurrentState. So it validates the RequestInvokerHandler short-circuit in isolation, which is valuable. But the in-code assertion message claims to verify "Gateway operator override must take absolute precedence over per-request AvailabilityStrategy" — that overstates what is being tested: a test-only setter is not the production path the operator actually exercises.
Add a sibling test that:
- Constructs a
DocumentClientwith a customer-configuredAvailabilityStrategy. - Drives the flag via
client.UpdatePartitionLevelFailoverConfigWithAccountRefresh(latestIsEnabled: true, latestDisableCrossRegionalHedging: true)(the actual production callback). - Sends a request with a per-request
RequestOptions.AvailabilityStrategyset. - Asserts the handler returns
null.
That covers the production reconcile + per-request precedence path end-to-end, locking in the contract the current test only asserts in isolation.
Description
Adds a Gateway-controlled
disableCrossRegionalHedgingaccount property that lets on-call engineers dynamically disable PPAF cross-region hedging without rolling back PPAF entirely.When the flag is
true, the SDK clears theAvailabilityStrategyon the next account-properties refresh; toggling it back tofalserestores the customer-configured strategy or rebuilds the SDK default.Type of change
Spec
openspec/changes/ppaf-dynamic-hedging-control/(proposal/design/spec/tasks)Changes
AccountProperties: new internalDisableCrossRegionalHedging(nullable bool) bound to JSON keydisableCrossRegionalHedging.GlobalEndpointManager:OnEnablePartitionLevelFailoverConfigChangedis nowAction<bool, bool>; trackslastKnownDisableCrossRegionalHedgingso refresh fires on either flag change.DocumentClient:GetInitializationTaskAsyncand stashes any customer-configuredAvailabilityStrategywhen the flag is true at startup.InitializePartitionLevelFailoverWithDefaultHedgingshort-circuits when the flag is true.UpdatePartitionLevelFailoverConfigWithAccountRefreshreconciles both PPAF state and the disable flag viaApplyHedgingStrategyForCurrentState.GatewayHedgingOverrideTests.AccountPropertiesJSON shape.Behavior
AvailabilityStrategy=nullAvailabilityStrategy=null(no stash; deterministic)DisablePartitionLevelFailoverClientLevelOverride=trueValidation
This change is a runtime mitigation lever for a production feature (PPAF cross-region hedging), so correctness across the toggle lifecycle and CI green across the full SDK matrix were both treated as gating signals.
Unit-test coverage (10 new tests in
GatewayHedgingOverrideTests)Each row in the Behavior table above has a dedicated test, plus dedicated coverage for the JSON contract and the init-time path:
JSON contract / forward-compat (
AccountProperties)..._DisableCrossRegionalHedging_True_Deserializes-- flagtrueround-trips...._False_Deserializes-- flagfalseround-trips...._Absent_DefaultsToNull-- missing key deserializes asnull(i.e. "no opinion"), preserving existing behavior for older accounts...._Unknown_FieldDoesNotBreakDeserialization-- unknown sibling fields don't break deserialization (forward-compat guard for future Gateway additions).Init-time path (
InitializePartitionLevelFailoverWithDefaultHedging)..._FlagTrue_SkipsApplyingDefaultStrategy-- when the flag istrueat startup, the SDK default hedging strategy is not installed...._FlagFalse_AppliesDefaultStrategy-- when the flag isfalse, the SDK default is installed as before (no regression for the normal path).Refresh-time reconciliation (
UpdatePartitionLevelFailoverConfigWithAccountRefresh)UpdateConfig_FlagToggleOn_StashesCustomerStrategyAndClearsAvailabilityStrategy-- flipping the flag totrueclearsConnectionPolicy.AvailabilityStrategyand stashes the customer's original reference (verified withAreSame, not just structural equality).UpdateConfig_FlagToggleOff_RestoresStashedCustomerStrategy-- flipping back tofalserestores the exact same customer strategy instance and clears the stash, validating the full enable -> disable -> re-enable cycle from task 6.6.UpdateConfig_FlagTrue_DoesNotStashSDKDefaultStrategy-- when only the SDK default is present, the flag clears it but does not stash it; toggling back rebuilds a fresh SDK default. This guards the "stash only customer-configured strategies" invariant called out in the design.UpdateConfig_ClientLevelOverrideDisabled_FlagIsIgnored-- whenDisablePartitionLevelFailoverClientLevelOverride=true, the Gateway flag is ignored end-to-end and the customer strategy is left untouched (client-level override remains absolute).The tests use compile-time-safe
internaltest-only properties (DisableCrossRegionalHedgingForTests,CustomerConfiguredAvailabilityStrategyForTests) instead of reflection, so any future rename of the underlying fields breaks the build rather than silently passing.Contract tests
The two existing tests that pin the
AccountPropertiesJSON shape (CosmosJsonSerializerUnitTests,CosmosSerializerCoreTests) were updated and re-run, confirming the new optional property is the only change to the wire contract.Review-feedback hardening (commit
816e334)A second pass addressed correctness concerns that aren't easy to assert from a unit test but materially raise confidence:
hedgingStrategyLockto serialize the(flag, stash, ConnectionPolicy.AvailabilityStrategy)mutation sequence, so concurrent refresh callbacks cannot interleave stash/clear/restore steps.GetInitializationTaskAsyncintoInitializeGatewayConfigurationReaderAsync, before theGlobalEndpointManagerbackground-refresh loop starts and before the change-event handler is subscribed -- eliminating the theoretical window where a refresh-driven event could be overwritten by a later init-time assignment.UpdatePartitionLevelFailoverConfigWithAccountRefreshno-ops when neither PPAF enablement nor the hedging flag changed, protecting future direct callers from spuriousAvailabilityStrategymutation.isEnabled->latestIsEnabled,disableCrossRegionalHedging->latestDisableCrossRegionalHedging) so a future edit that drops athis.prefix can't silently flip semantics.ApplyHedgingStrategyForCurrentStateemits aTraceWarningif it would silently drop a non-defaultAvailabilityStrategywhile a previously-stashed customer strategy is still held, so this edge case is observable in support logs rather than invisible.CI
All 31 required checks are green on the final commit, including:
Microsoft.Azure.Cosmos.Tests(full unit-test run) andMicrosoft.Azure.Cosmos.Tests Coverage.EmulatorTestsmatrix:Query/ChangeFeed/ReadFeed/Batch/Client Telemetry,MultiMaster,MultiRegion,ThinClient,Others, and theFlakylane.Preview Tests Release,Preview Flag Release,Preview Encryption EmulatorTests Release).EncryptionandEncryption.Customemulator tests (project-ref and package-ref).Static Analysis,CodeQL, and allAnalyze (...)jobs.PerformanceTests ReleaseandCosmosBenchmark(no perf regression in the hot path; the new logic only runs on account-refresh and at init, not per-request).Live long-running soak (~9.3 hours, 837 K ops, 22 toggle cycles)
In addition to the unit/contract tests and CI matrix, we ran the dedicated
PpafHedgingValidatorharness end-to-end against a real PPAF-enabled account in our internal test environment. While the harness ran the workload + fault injection, a separate operator script toggleddisableCrossRegionalHedgingon the account every 15 minutes by alternatingUpdateGlobalDatabaseAccount.ps1 -PropertyValue trueandRemoveGlobalDatabaseAccountConfigurationOverrides.ps1-- exactly the on-call mitigation gesture this PR is designed to support.The 12 h budgeted run was cut short by an OS auto-update at
2026-05-13 03:44 UTCafter 09:18:16 of wall time / 837,303 ops / 22 full toggle cycles captured. The artifacts (samples.csv,events.jsonl,requests.jsonl,report.md,timeline.png) cover the full 9.3 h window.Headline result -- hedging efficacy under live chaos (post-warmup samples, fault windows only)
ConnectionPolicy.AvailabilityStrategy(in-process)disableCrossRegionalHedging = false(hedging enabled)SdkDefault(CrossRegionHedging)disableCrossRegionalHedging = true(hedging disabled)nullWhen the gateway flag flipped to
false, the SDK rebuilt the hedging strategy and hedges fired against the faultedNorth Central USprimary -- cutting p99 by ~750 ms (~51%). When the gateway flag flipped totrue, the strategy cleared and the ~60x drop in hedge count demonstrates the disable path actually disabled hedging. The 65 residual hedges in thenullbucket are in-flight requests from the priorSdkDefaultwindow completing with their pre-flip hedges -- by construction the SDK cannot retroactively cancel those.Toggle-cadence alignment
After SDK warmup (~
20:16 UTConward),strategy_state-- read directly from in-processConnectionPolicy.AvailabilityStrategyvia reflection -- flipped betweenNullandSdkDefaulton a perfect 15-minute cadence, exactly matching the externalStart-Sleep -Seconds 900toggle script. Post-warmup cycle ratio: 52%Null/ 48%SdkDefault(effectively 50/50). Thetimeline.pngflag panel shows the gateway flag and the SDK-observed flag overlay one another across all 36 transitions.Other signals
report.mdtabulates 33 of 36 detected transitions as PASS. The 3 strict-comparator misses (No DocumentQuery Linq to SQL evaluation #4 / skip/take? #8 / Preserve some of the way creating a query worked in v2 #26 -- 48 / 47 / 65 hedges) are allfalse -> trueboundary noise from the in-flight tail described above; the strict numerical comparator does not yet treat a sub-poll-interval in-flight tail as a pass, but the underlying SDK behavior is correct.Validator-side caveats (harness bugs, not product code)
The
GatewayPropertyProbeandSdkStateInspectorreflective probes regressed mid-run, so per-samplegw_flag/sdk_flag/sdk_lastKnowncolumns were stuck at0for the whole run. Transition timestamps inreport.mdare therefore synthesized fromstrategy_state(Null-> flag=true,SdkDefault-> flag=false) by the harness's--rebuild-from-csvpath, which is accurate to the 30 s poll boundary.strategy_stateitself is a direct in-process read ofConnectionPolicy.AvailabilityStrategyand is the authoritative signal -- and as the table above shows, it tracks the external toggle cadence perfectly and yields a decisive separation in both p99 latency and hedge count between the two flag states.